Gen AI Evaluation Result #2563

singankit · 2025-07-24T19:02:39Z

Fixes #

Changes

PR Proposed a way to capture Evaluation Results for GenAI Applications.

Prototype: https://github.com/singankit/evaluation_results

Note: if the PR is touching an area that is not listed in the existing areas, or the area does not have sufficient domain experts coverage, the PR might be tagged as experts needed and move slowly until experts are identified.

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
Links to the prototypes or existing instrumentations (when adding or changing conventions)

model/gen-ai/registry.yaml

lmolkova

Are there publicly available prototypes of the code emitting evaluation results? Please link them in the PR description

model/gen-ai/registry.yaml

model/gen-ai/events.yaml

dmontagu · 2025-08-06T16:01:59Z

One issue I have with creating a separate span for tracking each evaluation score is that it makes it harder (at least with the way we index spans..) to write at least a couple classes of queries for specific cases:

Find all cases where the total number of tokens for the task is above a threshold and a score is below a threshold (or vice versa)
- This would require comparing information about token usage (or some other attribute, in the abstract) from the task execution span with the evaluation score from the evaluation score span
Find all cases where score A is above a threshold and score B is below a threshold
- This would require comparing two different evaluation score spans

It would work way better for us if there was a way that we could, while complying with the semantic conventions, put all this information as attributes of a single span so that we can query it at once. I guess we can do that in addition to complying with the semantic convention, this just sticks out to me as an unfortunate aspect of this design.

singankit · 2025-08-06T17:58:27Z

One issue I have with creating a separate span for tracking each evaluation score is that it makes it harder (at least with the way we index spans..) to write at least a couple classes of queries for specific cases:

Find all cases where the total number of tokens for the task is above a threshold and a score is below a threshold (or vice versa)

This would require comparing information about token usage (or some other attribute, in the abstract) from the task execution span with the evaluation score from the evaluation score span

Find all cases where score A is above a threshold and score B is below a threshold

This would require comparing two different evaluation score spans

It would work way better for us if there was a way that we could, while complying with the semantic conventions, put all this information as attributes of a single span so that we can query it at once. I guess we can do that in addition to complying with the semantic convention, this just sticks out to me as an unfortunate aspect of this design.

Thank you for your detailed feedback. I agree that having all relevant metrics as attributes on a single span would simplify querying and analysis, however not fully sure if this approach is flexible to cover various different scenario especially with async evaluations

To clarify with a concrete scenario:

On Day 1, I select 3 evaluation metrics for my GenAI application.
On Day 5, I realize that 3 metrics are not sufficient and decide to add 2 more, bringing the total to 5 metrics.

If I need to retroactively compute the 2 new metrics on existing traces that already contain the original 3 metrics, would the recommended approach be to generate new spans for these additional metrics? Or is there a preferred way to update the original span with the new evaluation results, while still adhering to the semantic conventions?

A similar situation could arise if some evaluation metrics are computed asynchronously or by a downstream service at different times. In such cases, would each metric (or set of metrics computed together) necessarily require a separate span, or is there flexibility to consolidate them as attributes on the original span?

Thanks again for your insights. I’m keen to understand the best practices for handling evolving needs of asynchronous evaluation workflows in line with the semantic conventions.

Would evaluation scores as events help for this case? If yes I can start a new issue to discuss if evaluations should be emitted as events in addition to span

model/gen-ai/registry.yaml

model/gen-ai/spans.yaml

model/gen-ai/registry.yaml

docs/gen-ai/gen-ai-spans.md

dmontagu · 2025-08-06T19:28:44Z

If I need to retroactively compute the 2 new metrics on existing traces that already contain the original 3 metrics, would the recommended approach be to generate new spans for these additional metrics? Or is there a preferred way to update the original span with the new evaluation results, while still adhering to the semantic conventions?

For the most part, I typically think of traces as static/immutable after their root span closes (I know that's not technically the case, and there are reasonable scenarios where that is explicitly avoided, but still). I think we should just stick to that mental model here — if you want to "mutate" an evaluation run/experiment/whatever-you-want-to-call-it, my personal feeling is that that should happen in some application-layer logic, and just rely on OTel for a static record of what happened when it happened. For example, you could create a new evaluation run where you just copy the old run's outputs for the metrics you've already computed, and compute new ones where you want.

I would personally have no problem just generating a new trace for the updated evaluation results (even if that meant copying execution data from an older trace or otherwise referencing it via span links or something), I don't need to extend an old trace. Maybe others feel differently, but just sharing my opinion.

I'll note that the idea of adding additional metrics also feels somewhat awkward because, I can always add new metrics to an old trace, but what about redefining an existing metric? Presumably that opens more of a can of worms about what it means to "overwrite" old spans? I personally feel it's better to avoid the whole issue by not encouraging this pattern. Just my 2c

singankit · 2025-08-06T21:22:28Z

If I need to retroactively compute the 2 new metrics on existing traces that already contain the original 3 metrics, would the recommended approach be to generate new spans for these additional metrics? Or is there a preferred way to update the original span with the new evaluation results, while still adhering to the semantic conventions?

For the most part, I typically think of traces as static/immutable after their root span closes (I know that's not technically the case, and there are reasonable scenarios where that is explicitly avoided, but still). I think we should just stick to that mental model here — if you want to "mutate" an evaluation run/experiment/whatever-you-want-to-call-it, my personal feeling is that that should happen in some application-layer logic and just rely on OTel for a static record of what happened when it happened. For example, you could create a new evaluation run where you just copy the old run's outputs for the metrics you've already computed and compute new ones where you want.

I would personally have no problem just generating a new trace for the updated evaluation results (even if that meant copying execution data from an older trace or otherwise referencing it via span links or something), I don't need to extend an old trace. Maybe others feel differently but just sharing my opinion.

I'll note that the idea of adding additional metrics also feels somewhat awkward because, I can always add new metrics to an old trace, but what about redefining an existing metric? Presumably that opens more of a can of worms about what it means to "overwrite" old spans? I personally feel it's better to avoid the whole issue by not encouraging this pattern. Just my 2c

The proposal in this PR is consistent with your feedback on traces as static/immutable. Proposal says keep the span being evaluated untouched but add a link in evaluation span which helps identify span that is being evaluated. Example below shows it. This has flexibility to add more evals as needed.

Span Being Evaluated

Evaluation Span

     {
    "name": "chat gpt-4o",
    "context": {
        "trace_id": "0xeb0fdf5670975fea194b2eef13e789c6",
        "span_id": "0x63e929946253cf52",
        "trace_state": "[]"
    },
    "kind": "SpanKind.CLIENT",
    "parent_id": null,
    "start_time": "2025-08-04T23:45:55.959766Z",
    "end_time": "2025-08-04T23:45:57.512342Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "gen_ai.operation.name": "chat",
        "gen_ai.system": "openai",
        "gen_ai.request.model": "gpt-4o",
        "server.address": "anksing1rpeastus2.openai.azure.com",
        "_MS.sampleRate": 100.0,
        "gen_ai.response.model": "gpt-4o-2024-11-20",
        "gen_ai.response.finish_reasons": [
            "stop"
        ],
        "gen_ai.response.id": "chatcmpl-C0zAK9AqI5MpwVScha6p1Dm5BgO7R",
        "gen_ai.usage.input_tokens": 15,
        "gen_ai.usage.output_tokens": 114
    },
    "events": [],
    "links": [],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.36.0",
            "service.name": "unknown_service"
        },
        "schema_url": ""
    }
}

      {
    "name": "evaluation relevance",
    "context": {
        "trace_id": "0x6ebb9835f43af1552f2cebb9f5165e39",
        "span_id": "0x89829115c2128845",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2025-08-04T23:48:35.592833Z",
    "end_time": "2025-08-04T23:48:35.592833Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "_MS.sampleRate": 100.0,
        "gen_ai.operation.name": "evaluation",
        "gen_ai.evaluation.name": "relevance",
        "gen_ai.evaluation.score": 4,
        "gen_ai.evaluation.label": "Pass",
        "gen_ai.evaluation.reasoning": "Response is relevant to the query."
    },
    "events": [],
    "links": [ // Added Links
        {
            "context": {
                "trace_id": "0xeb0fdf5670975fea194b2eef13e789c6",
                "span_id": "0x63e929946253cf52",
                "trace_state": "[]"
            },
            "attributes": {
                "gen_ai.operation.name": "evaluation"
            }
        }
    ],
    "resource": {
        "attributes": {
            "telemetry.sdk.language": "python",
            "telemetry.sdk.name": "opentelemetry",
            "telemetry.sdk.version": "1.36.0",
            "service.name": "unknown_service"
        },
        "schema_url": ""
    }
}

what about redefining an existing metric?

Would this be a good candidate for metric versioning? For version 2 there can a new evaluation span that links to span being evaluated?

Appreciate you bring up the scenarios and use cases. It will help shape this work. :)

docs/gen-ai/gen-ai-events.md

model/gen-ai/events.yaml

model/gen-ai/registry.yaml

docs/gen-ai/gen-ai-events.md

zhirafovod · 2025-08-26T05:28:50Z

Looks good to me overall

model/gen-ai/events.yaml

singankit · 2025-08-26T16:53:55Z

Thank you all for the valuable feedback and thoughtful discussion that helped bring this PR to a merge-ready state.
@lmolkova ,could you please proceed with merging the PR at your convenience? It has received the required approvals.

model/gen-ai/events.yaml

docs/gen-ai/gen-ai-events.md

model/gen-ai/events.yaml

docs/gen-ai/gen-ai-events.md

singankit added 4 commits July 24, 2025 11:59

Update events.yaml

cdb0938

Update registry.yaml

01b9ac4

Update gen-ai-events.md

1f8e684

Update gen-ai.md

485a76e

thompson-tomo reviewed Jul 25, 2025

View reviewed changes

model/gen-ai/registry.yaml Outdated Show resolved Hide resolved

singankit added 4 commits July 24, 2025 20:38

Adding gen_ai.evaluation.ouptut.metadata attribute

53f120d

Updating metadata attribute

a04553f

Updating md files

5338dee

Adding evaluation event header in docs

9ecd14e

lmolkova reviewed Jul 29, 2025

View reviewed changes

Span to capture evaluation result instead of events

f78bdd3

singankit marked this pull request as ready for review August 5, 2025 20:16

singankit requested review from a team as code owners August 5, 2025 20:16

github-project-automation bot added this to Semantic Conventions Triage Aug 5, 2025

github-project-automation bot moved this to Untriaged in Semantic Conventions Triage Aug 5, 2025

singankit changed the title ~~Gen AI Evaluation Event~~ Gen AI Evaluation Result Aug 5, 2025

Updating changelog

90e4b08

github-actions bot added enhancement New feature or request area:gen-ai labels Aug 5, 2025

Fixing yamllint issues

cdc4b0a

lmolkova reviewed Aug 6, 2025

View reviewed changes

Review comments updates

f334fee

singankit mentioned this pull request Aug 13, 2025

Spans for evaluation results in addition to events #2626

Open

Evaluation result as event

7fc2e36

lmolkova reviewed Aug 21, 2025

View reviewed changes

Doc review comments

aaa6367

lmolkova approved these changes Aug 21, 2025

View reviewed changes

srtux reviewed Aug 22, 2025

View reviewed changes

docs/gen-ai/gen-ai-events.md Show resolved Hide resolved

singankit added 3 commits August 25, 2025 22:34

Removing token usage attribute from evaluation result

ac0a67e

Merge main

2a5f787

Rebase from main and updating md files

890b9aa

alexmojaki reviewed Aug 26, 2025

View reviewed changes

model/gen-ai/events.yaml Show resolved Hide resolved

nirga approved these changes Aug 26, 2025

View reviewed changes

github-project-automation bot moved this from Untriaged to Needs More Approval in Semantic Conventions Triage Aug 26, 2025

zhirafovod approved these changes Aug 26, 2025

View reviewed changes

alexmojaki approved these changes Aug 26, 2025

View reviewed changes

Reviw comments for response_id attribute on evaluation result

ac667ef

alexmojaki reviewed Aug 26, 2025

View reviewed changes

model/gen-ai/events.yaml Outdated Show resolved Hide resolved

model/gen-ai/events.yaml Outdated Show resolved Hide resolved

Update event docs

5096a55

alexmojaki reviewed Aug 26, 2025

View reviewed changes

docs/gen-ai/gen-ai-events.md Outdated Show resolved Hide resolved

alexmojaki reviewed Aug 26, 2025

View reviewed changes

model/gen-ai/events.yaml Outdated Show resolved Hide resolved

Updating doc content

f6c4408

alexmojaki approved these changes Aug 26, 2025

View reviewed changes

srtux reviewed Aug 26, 2025

View reviewed changes

docs/gen-ai/gen-ai-events.md Show resolved Hide resolved

singankit mentioned this pull request Aug 26, 2025

Evaluation Result add attribute to capture confidence interval #2687

Open

srtux approved these changes Aug 26, 2025

View reviewed changes

lmolkova approved these changes Aug 26, 2025

View reviewed changes

lmolkova added this pull request to the merge queue Aug 26, 2025

Merged via the queue into open-telemetry:main with commit ebbf315 Aug 26, 2025
15 checks passed

shyamnamboodiripad mentioned this pull request Sep 19, 2025

[AI Evaluation] Support open telemetry sematic conventions for AI evaluation results dotnet/extensions#6828

Open

lmolkova mentioned this pull request Sep 30, 2025

Define how to record evaluations #1728

Closed

Gen AI Evaluation Result #2563

Gen AI Evaluation Result #2563

Conversation

singankit commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Merge requirement checklist

Uh oh!

Uh oh!

lmolkova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmontagu commented Aug 6, 2025

Uh oh!

singankit commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dmontagu commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

singankit commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhirafovod commented Aug 26, 2025

Uh oh!

Uh oh!

singankit commented Aug 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

singankit commented Jul 24, 2025 •

edited

Loading

dmontagu commented Aug 6, 2025 •

edited

Loading